While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language, finding that their estimated availability correlates with dataset availability. Through crowdsourcing experiments, we identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform. We conclude by making macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development.
translated by 谷歌翻译
在拍卖领域,了解重复拍卖中学习动态的收敛属性是一个及时,重要的问题,例如在线广告市场中有许多应用程序。这项工作着重于重复的首次价格拍卖,该物品具有固定值的竞标者学会使用基于平均值的算法出价 - 大量的在线学习算法,其中包括流行的无regret算法,例如多重权重更新,并遵循扰动的领导者。我们完全表征了基于均值算法的学习动力学,从收敛到拍卖的NASH平衡方面,具有两种感觉:(1)时间平均水平:竞标者在bidiper the NASH平衡方面的回合分数,在极限中均在极限中。 ; (2)最后一题:竞标者的混合策略概况接近限制的NASH平衡。具体而言,结果取决于最高值的投标人的数量: - 如果数量至少为三个,则竞标动力学几乎可以肯定地收敛到拍卖的NASH平衡,无论是在时间平时还是在最后近期的情况下。 - 如果数字为两个,则竞标动力学几乎可以肯定会在时间平时收敛到NASH平衡,但不一定在最后近期。 - 如果数字是一个,则竞标动力学可能不会在时间平均值或最后近期的时间内收敛到NASH平衡。我们的发现为学习算法的融合动力学研究开辟了新的可能性。
translated by 谷歌翻译
A crucial issue of current text generation models is that they often uncontrollably generate factually inconsistent text with respective of their inputs. Limited by the lack of annotated data, existing works in evaluating factual consistency directly transfer the reasoning ability of models trained on other data-rich upstream tasks like question answering (QA) and natural language inference (NLI) without any further adaptation. As a result, they perform poorly on the real generated text and are biased heavily by their single-source upstream tasks. To alleviate this problem, we propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck. WeCheck first utilizes a generative model to accurately label a real generated sample by aggregating its weak labels, which are inferred from multiple resources. Then, we train the target metric model with the weak supervision while taking noises into consideration. Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4\% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
translated by 谷歌翻译
Prompting large language models has enabled significant recent progress in multi-step reasoning over text. However, when applied to text generation from semi-structured data (e.g., graphs or tables), these methods typically suffer from low semantic coverage, hallucination, and logical inconsistency. We propose MURMUR, a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning. MURMUR is a best-first search method that generates reasoning paths using: (1) neural and symbolic modules with specific linguistic and logical skills, (2) a grammar whose production rules define valid compositions of modules, and (3) value functions that assess the quality of each reasoning step. We conduct experiments on two diverse data-to-text generation tasks like WebNLG and LogicNLG. These tasks differ in their data representations (graphs and tables) and span multiple linguistic and logical skills. MURMUR obtains significant improvements over recent few-shot baselines like direct prompting and chain-of-thought prompting, while also achieving comparable performance to fine-tuned GPT-2 on out-of-domain data. Moreover, human evaluation shows that MURMUR generates highly faithful and correct reasoning paths that lead to 26% more logically consistent summaries on LogicNLG, compared to direct prompting.
translated by 谷歌翻译
Crowd counting is usually handled in a density map regression fashion, which is supervised via a L2 loss between the predicted density map and ground truth. To effectively regulate models, various improved L2 loss functions have been proposed to find a better correspondence between predicted density and annotation positions. In this paper, we propose to predict the density map at one resolution but measure the density map at multiple resolutions. By maximizing the posterior probability in such a setting, we obtain a log-formed multi-resolution L2-difference loss, where the traditional single-resolution L2 loss is its particular case. We mathematically prove it is superior to a single-resolution L2 loss. Without bells and whistles, the proposed loss substantially improves several baselines and performs favorably compared to state-of-the-art methods on four crowd counting datasets, ShanghaiTech A & B, UCF-QNRF, and JHU-Crowd++.
translated by 谷歌翻译
Crowd localization aims to predict the spatial position of humans in a crowd scenario. We observe that the performance of existing methods is challenged from two aspects: (i) ranking inconsistency between test and training phases; and (ii) fixed anchor resolution may underfit or overfit crowd densities of local regions. To address these problems, we design a supervision target reassignment strategy for training to reduce ranking inconsistency and propose an anchor pyramid scheme to adaptively determine the anchor density in each image region. Extensive experimental results on three widely adopted datasets (ShanghaiTech A\&B, JHU-CROWD++, UCF-QNRF) demonstrate the favorable performance against several state-of-the-art methods.
translated by 谷歌翻译
Information seeking users often pose questions with false presuppositions, especially when asking about unfamiliar topics. Most existing question answering (QA) datasets, in contrast, assume all questions have well defined answers. We introduce CREPE, a QA dataset containing a natural distribution of presupposition failures from online information-seeking forums. We find that 25% of questions contain false presuppositions, and provide annotations for these presuppositions and their corrections. Through extensive baseline experiments, we show that adaptations of existing open-domain QA models can find presuppositions moderately well, but struggle when predicting whether a presupposition is factually correct. This is in large part due to difficulty in retrieving relevant evidence passages from a large text corpus. CREPE provides a benchmark to study question answering in the wild, and our analyses provide avenues for future work in better modeling and further studying the task.
translated by 谷歌翻译
作为第一个会话级的中文数据集,Chase包含两个单独的部分,即从Scratch(Chase-C)手动构建的2,003个会话,以及从英语SPARC(Chase-T)翻译的3,456个会话。我们发现这两个部分是高度差异,并且作为培训和评估数据不兼容。在这项工作中,我们介绍了SESQL,这是中文的另一个大规模会话级文本到SQL数据集,由5,028个会话组成,所有课程都是从Scratch手动构建的。为了保证数据质量,我们采用迭代注释工作流程,以促进对先前的自然语言(NL)问题和SQL查询的紧张和及时审查。此外,通过完成所有与上下文有关的NL问题,我们获得了27,012个独立的问题/SQL对,允许SESQL用作单轮多DB文本到SQL解析的最大数据集。我们通过使用三个竞争性会话级解析器,并提供详细的分析,对SESQL进行基准测试级文本到SQL解析实验。
translated by 谷歌翻译
尽管预训练的语言模型(LMS)在许多NLP任务中都取得了重大改进,但人们越来越关注探索LMS的能力并解释其预测。但是,现有作品通常仅着眼于某些下游任务的特定功能。缺乏直接评估蒙版单词预测性能和预训练LMS的解释性的数据集。为了填补空白,我们提出了一个新颖的评估基准,以提供英语和中文注释的数据。它在多个维度(即语法,语义,知识,推理和计算)中测试LMS能力。此外,它提供了满足足够和紧凑性的仔细注释的令牌级别的理由。它包含每个原始实例的扰动实例,以便将扰动下的基本原理一致性用作忠实的指标,即解释性的观点。我们在几个广泛使用的预训练的LMS上进行实验。结果表明,他们在知识和计算的维度上表现较差。而且它们在所有维度上的合理性远非令人满意,尤其是当理由缩短时。此外,我们评估的预训练的LMS在语法感知数据上并不强大。我们将以\ url {http:// xyz}发布此评估基准,并希望它可以促进预训练的LMS的研究进度。
translated by 谷歌翻译
我们介绍了Realtime QA,这是一个动态的问答(QA)平台,该平台宣布问题并定期评估系统(此版本每周)。实时质量检查询问当前世界,质量检查系统需要回答有关新事件或信息的问题。因此,它挑战了QA数据集中的静态,常规假设,并追求瞬时应用。我们在包括GPT-3和T5在内的大型语言模型上建立了强大的基线模型。我们的基准是一项持续的努力,该初步报告在过去一个月中提出了实时评估结果。我们的实验结果表明,GPT-3通常可以根据新的退休文档正确更新其生成结果,从而突出了最新信息检索的重要性。尽管如此,我们发现GPT-3倾向于在检索文件时返回过时的答案,这些文件没有提供足够的信息来找到答案。这表明了未来研究的重要途径:开放式域质量检查系统是否可以确定无法回答的案例,并与用户甚至检索模块进行通信以修改检索结果?我们希望实时质量检查能够刺激问题答案及其他问题的瞬时应用。
translated by 谷歌翻译